Skip to content

fix: centralize UTF-8 file I/O for Windows compatibility#45

Merged
ar7casper merged 11 commits intoknostic:release/2026-05-10from
joshbouncesecurity:fix/issue16-09-utf8-io
May 10, 2026
Merged

fix: centralize UTF-8 file I/O for Windows compatibility#45
ar7casper merged 11 commits intoknostic:release/2026-05-10from
joshbouncesecurity:fix/issue16-09-utf8-io

Conversation

@joshbouncesecurity
Copy link
Copy Markdown
Contributor

Summary

Bare open() calls use the system encoding (cp1252 on Windows), causing charmap codec can't decode errors on any target codebase containing non-ASCII characters (e.g. curly quotes U+2019, accented characters, CJK).

Adds utilities/file_io.py with centralized UTF-8 helpers (open_utf8, read_json, write_json, run_utf8) and migrates ~190 bare open() call sites across libs/openant-core/ (core, parsers, utilities, openant CLI, top-level scripts) to specify encoding="utf-8" explicitly. Also sets encoding/errors on the docker_executor subprocess.run that captures container stdout/stderr as text.

A regression test scans non-test Python files for any bare open() call without an encoding= argument and fails if a regression reappears.

Addresses item 9 from #16 (does not close the issue).

Test plan

  • Helper unit tests: non-ASCII round-trip via open_utf8, read_json, write_json.
  • run_utf8 captures non-ASCII subprocess output (incl. universal_newlines=True alias).
  • run_utf8 only injects encoding/errors when caller asks for text mode; explicit encoding= is respected.
  • Regression: no bare open( in non-test code (test scans libs/openant-core/ excluding tests/ and the helper).
  • Existing tests pass: pytest tests/ -> 96 passed, 22 skipped (env-dependent).
  • Manual: run a parse against a repo with non-ASCII content on a cp1252 console -- no charmap errors.

joshbouncesecurity and others added 2 commits May 4, 2026 21:24
Bare open() calls use the system encoding (cp1252 on Windows), causing
'charmap codec can't decode byte ...' errors when parsing repositories
containing non-ASCII characters such as curly quotes.

Adds utilities/file_io.py with open_utf8, read_json, write_json, and
run_utf8 helpers, and migrates ~190 bare open() call sites across
libs/openant-core/ (core, parsers, utilities, openant CLI, top-level
scripts) to specify encoding="utf-8" explicitly. Also sets
encoding/errors on the docker_executor subprocess.run that captures
container stdout/stderr as text.

Includes a regression test that scans non-test code for any bare open()
call without an encoding= argument and fails if a regression reappears.

Addresses item 9 from #16.
… UTF-8

Round 1 review fixes for PR knostic#45:

- application_context.py, ast_parser.py, dataset_enhancer.py, report/__main__.py,
  report/generator.py: pass encoding='utf-8' on every Path.read_text() /
  write_text() call. The previous migration only covered open() calls; pathlib's
  text helpers also default to the system locale on Windows (cp1252) and crash
  on non-ASCII source code.
- parsers/{c,go,javascript,php,ruby}/test_pipeline.py: pass
  encoding='utf-8', errors='replace' on subprocess.run(text=True) invocations
  of parser binaries and CodeQL. Only docker_executor.py was migrated before;
  these other call sites had the same Windows cp1252 hazard.
- tests/test_file_io.py: extend regression scan with two new asserts —
  Path.read_text/write_text without encoding=, and subprocess.run(text=True)
  without encoding=. Refactored the call-walking logic into a shared helper.

All 14 file_io tests pass; full tests/ suite: 98 passed, 22 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@joshbouncesecurity
Copy link
Copy Markdown
Contributor Author

Manual verification

  • Windows cp1252 console: scan a target repo containing non-ASCII content (e.g., Japanese strings, accented chars in comments). Confirm no UnicodeDecodeError / charmap codec can't decode errors.
  • Linux/macOS: same scan succeeds (no regression).
  • Subprocess paths: docker_executor capturing non-ASCII output: no decode crash.
  • Path-method I/O: any Path.read_text() / write_text() calls covered by the round-2 follow-up still work cross-platform.
  • git grep 'open(' libs/openant-core --include='*.py' (excluding tests + the helper module): no bare open( calls remain. The new regression test enforces this in CI.
  • JSON round-trip: utilities/file_io.py write_json then read_json of a dict with {"name": "héllo 日本語"} returns the same dict (not escaped sequences).

@joshbouncesecurity
Copy link
Copy Markdown
Contributor Author

Local test results

Verified the migration on Windows by static-grepping the worktree and round-tripping non-ASCII content through utilities.file_io.

Commands run:

# 1. No bare open() in non-test code
git grep -nE "open\([^)]*\)" -- 'libs/openant-core/**/*.py' \
    ':!libs/openant-core/tests/**' ':!libs/openant-core/utilities/file_io.py' \
    | grep -v "encoding=" | grep -v "open_utf8"
# Two hits, both legitimate:
#   parsers/zig/function_extractor.py:43:  with open(full_path, "rb") as f:   <- binary mode, no encoding needed
#   utilities/agentic_enhancer/prompts.py:68:  - Files: open() with external paths   <- text inside a docstring/prompt

# 2. Round-trip via utilities.file_io
python check45.py

Output:

OK default round-trip preserves non-ASCII via \uXXXX escapes
OK ensure_ascii=False writes raw UTF-8 bytes; read_json still round-trips
OK open_utf8 round-trips curly quote on this Windows shell

Outcome:

  • No bare open() in non-test code (only legitimate "rb" and a docstring fragment) ✅
  • read_json / write_json round-trip a dict with {"name": "héllo 日本語", "list": [..., "’"]} byte-identically ✅
  • write_json(..., ensure_ascii=False) writes raw UTF-8 bytes (verified e6 97 a5 e6 9c ac e8 aa 9e for the JP run); read_json still parses it back to the same dict ✅
  • open_utf8 round-trips a U+2019 curly quote on this Windows shell (default cp1252 console) ✅
  • Did not run a full openant scan against a non-ASCII codebase end-to-end — would have required an LLM call. The above confirms the helper works on Windows; the migration's call-site coverage is enforced by the new regression test.

@joshbouncesecurity joshbouncesecurity marked this pull request as ready for review May 7, 2026 11:22
Copy link
Copy Markdown
Collaborator

@ar7casper ar7casper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I'd like your call on before merging: utilities/file_io.py defines open_utf8 /
read_json / write_json / run_utf8, but the migration inlines encoding="utf-8" at
every call site rather than routing through these helpers. So as it stands, the helper
module is only imported by its own test file — production code never touches it.

Three ways to go:

  1. Adopt the helpers in this PR — single chokepoint to change later (e.g. if we ever
    want to switch to errors="strict" or add logging). Bigger churn though, and it's literally
    one keyword argument we'd be hiding.
  2. Keep the helpers as the recommended pattern for new code — current state. Fine, but
    they'll quietly rot unless we tell contributors to reach for them. Worth a one-liner in
    libs/openant-core/CLAUDE.md if you go this way.
  3. Drop the helpers and rely on inlined encoding="utf-8" + the regression scanners as
    the contract. Less surface area, one less thing to import.

joshbouncesecurity and others added 2 commits May 10, 2026 13:26
The previous commit inlined encoding="utf-8" at every call site, but
production code already used the file_io helpers (read_json, write_json,
open_utf8) on master. This restores that pattern so the helpers are
actually exercised and remain the single chokepoint for I/O policy.

All four regression scanner tests (bare open, pathlib text I/O, dot
open, text-mode subprocess) continue to pass.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ng failure

Bare open() without encoding= uses cp1252 on Windows, which would crash
on non-ASCII dataset content. These calls had read_json on master.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@joshbouncesecurity joshbouncesecurity marked this pull request as draft May 10, 2026 10:35
joshbouncesecurity and others added 7 commits May 10, 2026 13:38
Three files that used helpers on master still had inline encoding after
the previous fix pass: application_context.py (save_context write),
parser_adapter.py (diff_filter report write), and
agentic_enhancer/repository_index.py (load_index_from_file read).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace inline open/json.load and open/json.dump patterns with
read_json/write_json from utilities/file_io.py in the ten files
added by this branch that were still using bare I/O:

- core/diff_filter.py
- core/schemas.py (StepReport.write)
- openant/cli.py (all json.load calls in cmd_*)
- report/__main__.py (cmd_summary, cmd_disclosures)
- report/generator.py (merge_dynamic_results, generate_all)
- utilities/context_enhancer.py (checkpoint read/write paths)
- parsers/zig/{call_graph_builder,function_extractor,repository_scanner,unit_generator}.py

All 15 regression-scanner tests pass after this change.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace filepath.read_text + json.loads with read_json in
check_for_manual_override, keeping read_text only for the
.md parsing branch which needs the raw text for regex.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- reporter.py: open_utf8 for markdown report writes
- docker_executor.py: open_utf8 for Dockerfile/compose/script writes
- parsers/*/test_pipeline.py: read_json/write_json for all JSON
  reads and writes; open_utf8 for text file writes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- subprocess.run(..., encoding="utf-8", errors="replace") →
  run_utf8(...) in all test_pipeline.py files and docker_executor.py
- Path.write_text(encoding="utf-8") → open_utf8 + f.write() in
  report/__main__.py and report/generator.py
- Path.read_text(encoding="utf-8") → open_utf8 + f.read() in
  report/generator.py (load_prompt)

Remaining read_text(errors="ignore/replace") calls in parsers and
application_context.py stay inline — open_utf8 does not support the
errors= override, and these reads are intentionally lenient on
non-UTF-8 source files.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
open_utf8 passes **kwargs through to open(), so errors="ignore" and
errors="replace" work correctly. Replace all remaining Path.read_text
inline uses in application_context.py, ast_parser.py, and
dataset_enhancer.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor Author

@joshbouncesecurity joshbouncesecurity left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with option 1. All production code (and the test pipelines) now routes through the helpers — read_json, write_json, open_utf8, and run_utf8 — rather than inlining encoding="utf-8" at every call site. The four regression scanner tests in test_file_io.py enforce this going forward and will catch any backslide on bare open(), Path.read_text/write_text, or subprocess.run(text=True) without the helpers.

The only remaining encoding="utf-8" references in production code are calls that also pass errors="ignore" or errors="replace" to read arbitrary source files — open_utf8 passes **kwargs through to open(), so those use open_utf8(path, errors="replace") etc. rather than read_text.

@joshbouncesecurity joshbouncesecurity marked this pull request as ready for review May 10, 2026 11:22
@joshbouncesecurity
Copy link
Copy Markdown
Contributor Author

@ar7casper I think this is ok now

joshbouncesecurity added a commit to joshbouncesecurity/OpenAnt that referenced this pull request May 10, 2026
Adds test_no_bare_path_calls_in_typescript_analyzer, a cross-platform scanner
that greps typescript_analyzer.js for any path.relative/resolve/join() call not
accompanied by toPosixPath() within a ±6-line window. This mirrors the PR knostic#45
pattern (test_no_bare_open, test_no_bare_pathlib_text_io, etc.) that prevents
contributors from reintroducing encoding antipatterns.

Scoped to typescript_analyzer.js only — other JS parser files legitimately call
path.X() without toPosixPath() since they do not interact with ts-morph.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@joshbouncesecurity joshbouncesecurity changed the base branch from master to release/2026-05-10 May 10, 2026 12:37
@ar7casper ar7casper merged commit 8bad38d into knostic:release/2026-05-10 May 10, 2026
8 checks passed
@joshbouncesecurity joshbouncesecurity deleted the fix/issue16-09-utf8-io branch May 10, 2026 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants